This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Wed Sep 27 00:45:06 2023.
Data Description:
This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Bike-sharing rental process is highly correlated to the environmental and seasonal settings. For instance, weather conditions, precipitation, day of week, season, hour of the day, etc. can affect the rental behaviors.
Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues.
More information on the dataset can be gotten from this:
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
Relevant Paper:
Fanaee-T, Hadi, and Gama, Joao. Event labeling combining ensemble detectors and background knowledge, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg
## Import required packages
install.packages('stats')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
## Warning: package 'stats' is a base package, and should not be updated
install.packages('forecast')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages('highcharter')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(stats)
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(highcharter)
## Loading the datasets
data <- read.csv("hour.csv")
print(head(data))
## instant dteday season yr mnth hr holiday weekday workingday weathersit
## 1 1 2011-01-01 1 0 1 0 0 6 0 1
## 2 2 2011-01-01 1 0 1 1 0 6 0 1
## 3 3 2011-01-01 1 0 1 2 0 6 0 1
## 4 4 2011-01-01 1 0 1 3 0 6 0 1
## 5 5 2011-01-01 1 0 1 4 0 6 0 1
## 6 6 2011-01-01 1 0 1 5 0 6 0 2
## temp atemp hum windspeed casual registered cnt
## 1 0.24 0.2879 0.81 0.0000 3 13 16
## 2 0.22 0.2727 0.80 0.0000 8 32 40
## 3 0.22 0.2727 0.80 0.0000 5 27 32
## 4 0.24 0.2879 0.75 0.0000 3 10 13
## 5 0.24 0.2879 0.75 0.0000 0 1 1
## 6 0.24 0.2576 0.75 0.0896 0 1 1
## checking quality of data before analysis
# Checking if there are any missing values in the data frame (hour_df and day_df)
cat("Are there any missing data in the dataframe?: ", anyNA(data), "\n")
## Are there any missing data in the dataframe?: FALSE
print(summary(data))
## instant dteday season yr
## Min. : 1 Length:17379 Min. :1.000 Min. :0.0000
## 1st Qu.: 4346 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median : 8690 Mode :character Median :3.000 Median :1.0000
## Mean : 8690 Mean :2.502 Mean :0.5026
## 3rd Qu.:13034 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :17379 Max. :4.000 Max. :1.0000
## mnth hr holiday weekday
## Min. : 1.000 Min. : 0.00 Min. :0.00000 Min. :0.000
## 1st Qu.: 4.000 1st Qu.: 6.00 1st Qu.:0.00000 1st Qu.:1.000
## Median : 7.000 Median :12.00 Median :0.00000 Median :3.000
## Mean : 6.538 Mean :11.55 Mean :0.02877 Mean :3.004
## 3rd Qu.:10.000 3rd Qu.:18.00 3rd Qu.:0.00000 3rd Qu.:5.000
## Max. :12.000 Max. :23.00 Max. :1.00000 Max. :6.000
## workingday weathersit temp atemp
## Min. :0.0000 Min. :1.000 Min. :0.020 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:1.000 1st Qu.:0.340 1st Qu.:0.3333
## Median :1.0000 Median :1.000 Median :0.500 Median :0.4848
## Mean :0.6827 Mean :1.425 Mean :0.497 Mean :0.4758
## 3rd Qu.:1.0000 3rd Qu.:2.000 3rd Qu.:0.660 3rd Qu.:0.6212
## Max. :1.0000 Max. :4.000 Max. :1.000 Max. :1.0000
## hum windspeed casual registered
## Min. :0.0000 Min. :0.0000 Min. : 0.00 Min. : 0.0
## 1st Qu.:0.4800 1st Qu.:0.1045 1st Qu.: 4.00 1st Qu.: 34.0
## Median :0.6300 Median :0.1940 Median : 17.00 Median :115.0
## Mean :0.6272 Mean :0.1901 Mean : 35.68 Mean :153.8
## 3rd Qu.:0.7800 3rd Qu.:0.2537 3rd Qu.: 48.00 3rd Qu.:220.0
## Max. :1.0000 Max. :0.8507 Max. :367.00 Max. :886.0
## cnt
## Min. : 1.0
## 1st Qu.: 40.0
## Median :142.0
## Mean :189.5
## 3rd Qu.:281.0
## Max. :977.0
This Analysis is centered around the variable ‘cnt’ which tells us the total count of registered users per hour(hour_df) or per day(day_df).
The summary statistic for the ‘cnt’ variable shows that;
The minimum registered users on a particular day is 1
Has an average number of daily registered users of 190.
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
holiday_day <- data %>% group_by(holiday) %>% summarize(registered_users = sum(cnt))
holiday_day <- holiday_day %>% mutate(holiday = recode(holiday, `0` = "No", `1` = "Yes"))
print(holiday_day)
## # A tibble: 2 × 2
## holiday registered_users
## <chr> <int>
## 1 No 3214244
## 2 Yes 78435
Lets put this in a visualization to better present it
barplot(holiday_day$registered_users, names.arg = holiday_day$holiday, col = "blue", main = "Effects of holiday on number of registered users", xlab = "Holiday", ylab = "Number of registered Users")
The seasonality (Holidays) really affects the number of users that registers. The difference between the Number of Registered Users when there are no holidays and Number of Registered users when there are holidays is very large.
The seasonality of holidays, i.e there are holidays, contributes 2.4% to the number of users registered in that program.
This explains that, during holidays, most percentage of people don’t go out for businesses or work.
Now, lets go over the relationship between the seasons of a year and the number of registered users. Does the season have any effect on the number of users registering? well, let’s quench the curiosity by actually looking at how each of them contributes to the number of registered users.
Since the seasons variable contains discrete values, we would be using the barplot to show the relationship.
## grouping the data
seasons <- data %>% group_by(season) %>% summarize(registered_users = sum(cnt))
seasons <- seasons %>% mutate(season = recode(season, `1` = "spring", `2` = "summer", `3` = "fall", `4` = "winter"))
max_value = max(seasons$registered_users)
barplot(seasons$registered_users, names.arg = seasons$season, col = "purple", main = "Effects of Seasons on number of registered users", ylim = c(0, max_value+200000), xlab = "seasons", ylab = "Number of registered Users")
During the spring seasons, there are low registrations, and during the fall season, the number of registration is at its peak than any other seasons.
Research tells us that spring season is the period after winter, i.e, it is the transmission period from cold to warm, whereby, fall seasons(also known as autumn) is the transmission period from warm to cold.
This supports our analysis of number of registered users by seasons. At the spring season, the temperature is changing from cold to warm, this indicates that a whole population of people don’t ride bikes on a sunny or hot period, because of health challenges and some other factors. While, looking at the fall season(autumn), it has the highest number of registered users, this is because there is a change in the temperature from warm(hot) to cold.
To put it all together: - Users registers more during a cold season than hot(warm) seasons. - And, seasons affect the number of registered users.
plot(x = data$cnt, y = data$temp, main='Relationship Between Temp and Users registration', xlab='No of registered users', ylab='temperature in celscius')
The temperature does not neccesarily affect the registration rate of users. The visualization shows that there is no correlation between the number of registered users and temperature of that day.
# installing packages
install.packages('timetk')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages('plotly')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
install.packages('lubridate')
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
#loading packages
library(plotly)
## Loading required package: ggplot2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(timetk)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
# getting the time series data
data_time = data[c('dteday', 'cnt')]
# Converting the date variable to a date format
data_time$dteday <- as.Date(data_time$dteday)
# Converting the Date column to a time series object using timetk
data_time <- tk_tbl(data_time, date_var = "dteday")
## Warning in tk_tbl.data.frame(data_time, date_var = "dteday"): Warning: No index
## to preserve. Object otherwise converted to tibble successfully.
# Create an interactive time series plot using plotly
plot_ly(data_time, x = ~dteday, y = ~cnt, type = "scatter", mode = "lines") %>%
layout(title = "Interactive Time Series Plot", xaxis = list(title = "date"), yaxis = list(title = "number of registered users"))
Smoothing time series data involves reducing noise or fluctuations in the data to reveal underlying trends and patterns more clearly.
There are several methods to smooth time series data. Here we would be performing the Holt-Winters exponential smoothing.
install.packages("forecast")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(forecast)
# Perform Holt-Winters smoothing
smoothed_data <- ets(data_time$cnt, model = 'ZZZ')
data_time_smooth <- data_time
data_time_smooth$cnt <- fitted(smoothed_data)
plot(data_time$dteday, data_time$cnt, type = 'l', col = 'red', xlab = 'Date', ylab = 'number of registered users')
lines(data_time$dteday, fitted(smoothed_data), type = 'l', col = 'blue')
legend('topleft', legend = c('Original', 'Smoothed'), col = c('red', 'blue'), lty = 1)
install.packages("tseries")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.3'
## (as 'lib' is unspecified)
library(tseries)
# Checking for stationarity using the Augmented Dickey-Fuller test
adf_test <- adf.test(data_time_smooth$cnt)
## Warning in adf.test(data_time_smooth$cnt): p-value smaller than printed p-value
print(adf_test)
##
## Augmented Dickey-Fuller Test
##
## data: data_time_smooth$cnt
## Dickey-Fuller = -9.5484, Lag order = 25, p-value = 0.01
## alternative hypothesis: stationary
We see that the p-value is lower than the significance level of 0.05, thereby rejecting the null hypothesis(not stationary) and accepting the alternative hypothesis(stationary).
Therefore, we conclude that the data is stationary.
# fitting an arima model
arima_model <- auto.arima(data_time_smooth$cnt)
print(arima_model)
## Series: data_time_smooth$cnt
## ARIMA(2,1,1)
##
## Coefficients:
## ar1 ar2 ma1
## 0.4504 -0.3268 -0.0647
## s.e. 0.0235 0.0095 0.0249
##
## sigma^2 = 8458: log likelihood = -103230.3
## AIC=206468.5 AICc=206468.5 BIC=206499.6
The auto.arima() function automatically selects the best ARIMA model order based on AIC (Akaike Information Criterion).
The ARIMA model it selected, ARIMA(2,1,1) indicates:
AR(2): Two autoregressive terms. I(1): The time series has been differenced one times. MA(1): One moving average terms.
#forecasting with the arima model
forecast_values <- forecast(arima_model, h = 31) # Forecasting 1 month ahead
print(forecast_values, "\n")
## Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
## 17380 50.11389 -67.7458 167.9736 -130.1369 230.3647
## 17381 54.68647 -146.7102 256.0831 -253.3231 362.6960
## 17382 60.30438 -188.0116 308.6203 -319.4621 440.0708
## 17383 61.34007 -215.4496 338.1297 -361.9732 484.6534
## 17384 59.97051 -240.8542 360.7952 -400.1012 520.0422
## 17385 59.01525 -266.1098 384.1403 -438.2206 556.2511
## 17386 59.03263 -290.1136 408.1788 -474.9404 593.0057
## 17387 59.35264 -312.2558 430.9610 -508.9734 627.6787
## 17388 59.49108 -332.8785 451.8606 -540.5864 659.5685
## 17389 59.44884 -352.4617 471.3593 -570.5139 689.4116
## 17390 59.38458 -371.2290 489.9982 -599.1821 717.9512
## 17391 59.36944 -389.2415 507.9804 -626.7219 745.4607
## 17392 59.38363 -406.5474 525.3147 -653.1964 771.9637
## 17393 59.39496 -423.2202 542.0101 -678.7012 797.4911
## 17394 59.39543 -439.3342 558.1251 -703.3458 822.1366
## 17395 59.39194 -454.9470 573.7309 -727.2217 846.0055
## 17396 59.39021 -470.1015 588.8819 -750.3974 869.1779
## 17397 59.39058 -484.8337 603.6149 -772.9287 891.7098
## 17398 59.39130 -499.1769 617.9595 -794.8650 913.6476
## 17399 59.39151 -513.1606 631.9437 -816.2514 935.0345
## 17400 59.39137 -526.8111 645.5938 -837.1279 955.9106
## 17401 59.39124 -540.1509 658.9334 -857.5294 976.3118
## 17402 59.39122 -553.2003 671.9828 -877.4867 996.2691
## 17403 59.39126 -565.9775 684.7600 -897.0277 1015.8102
## 17404 59.39128 -578.4987 697.2813 -916.1773 1034.9598
## 17405 59.39128 -590.7789 709.5614 -934.9581 1053.7407
## 17406 59.39127 -602.8313 721.6139 -953.3908 1072.1734
## 17407 59.39127 -614.6684 733.4509 -971.4940 1090.2765
## 17408 59.39127 -626.3011 745.0836 -989.2846 1108.0672
## 17409 59.39127 -637.7397 756.5222 -1006.7785 1125.5610
## 17410 59.39127 -648.9936 767.7761 -1023.9899 1142.7724
cat("Forcasted number of users that will register in the next 1 month: ", sum(forecast_values$mean))
## Forcasted number of users that will register in the next 1 month: 1829.943
plot(forecast_values, main = "Next 1 Month Forecast", ylim = c(0, 2000), xlim = c(16000, dim(data_time_smooth)[1] + 110), xlab = "Day", ylab = "No of registered users")
The forecasts are shown as a blue line, with the 80% prediction intervals as a dark ash shaded area, and the 95% prediction intervals as an ash shaded area.
Seasonality affects the rental demand of bikes
The Impacts of each seasons of the year on the rental demand of bikes are as follows:
Holidays has a negative effect on the rental demand of bikes, there is a decrease of demand during holidays.
The Fall(Autumn) season contributes the highest to the rental demands of bikes.
There is no effect of temperature to the demand of bikes.
The time series model forecasts that there would be a 0.06% increase for the rental demand of bikes within the next 1 month.